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In this article we present our implementation of a Hybrid Monte Carlo algorithm for Lattice 
Gauge Theory using two degenerate flavours of Wilson-Dirac fermions on a Fermi GPU. We find 
that using registers instead of global memory speeds up the code by almost an order of magnitude. 
To map the array variables to scalars, so that the compiler puts them in the registers, we use code 
generators. Our final program is more than 10 times faster than a generic single CPU. 



I. INTRODUCTION 

Lattice gauge theory [l[ is a formulation of gauge the- 
ory on a Euclidean space-time lattice A. A consists of 
iV s ite points n ordered in some fashion and d x iV S ite links 
(n, fi) where d is the number of space-time dimensions. 
This formulation of gauge theory is amenable to com- 
puter simulations. Lattice gauge theory simulations can 
perform non-perturbative computations for static observ- 
ables in Quantum Chromodynamics (QCD). The expec- 
tation value of an observable O can be obtained by eval- 
uating a Euclidean path integral as 

(O) = J [DU]diPidijMOe-M D(u) + m ^- s 9( u ). (i) 

U n ,n £ SU(N C ) is a group valued matrix H defined on 
the link (n, /i) and plays the role of a parallel transporter 
(gauge field), ipi represents a fermionic species with mass 
nii and is defined at each lattice point n. D is a lattice 
Dirac operator and S g is a discretized version of the con- 
tinuum Yang-Mills lagrangian |trF /J!/ F MW '. 

We first integrate out the fermionic fields to obtain 

(0)= f [DU] Y[ det (D ^ + mi) O e~ Ss {u) . (2) 

Computing the determinant is an 0(N 3 ) process for a 
N x N matrix and for the 4-dimensional Dirac oprator 
N = N c x 4 x A^sitc- Even for moderate lattice sizes 
N ~ 10 6 . An explicit evaluation of this determinant is 
not feasible. It is usually re-exponentiated by introducing 
additional bosonic fields <f> called pseudofermions as 

(O) = J [DWlWOe-tKB™-^)- 1 ** ~ s ^ u \ (3) 

The path integral now involves evaluating the inverse of 
the matrix + mA on the vector </>j. This can be 

done in at most O(N) steps using conjugate gradient 
(CG) type of algorithms. 



Though lattice gauge theory simulations are computa- 
tionally expensive, they can be easily parallelized. Such 
simulations can therefore be efficiently carried out on a 
Graphics Processing Units (GPUs) which have about 500 
compute cores. Since the first implementation of lattice 
gauge simulations on GPUs in 2007 [2] there has been 
numerous implementations of various formulations of lat- 
tice gauge theory on GPUs. A collection of GPU subrou- 
tines for several lattice actions has been put together in a 
package called QUDA Q . Implementation of pure gauge 
codes and spin models on GPUs have been reported in 
i and @. 

In this article we explore a new method of implement- 
ing lattice simulations on the GPU using register vari- 
ables and avoid the slow global memory for storing run- 
time variables. We consider one of the most popular 
formulations namely lattice gauge theory with two mass 
degenerate fermions. This partition function can be writ- 
ten as 

Z = J [DU]d4> ] -d4e~*\**fww)+ ~ S<j(C/) . (4) 

where M(U) = £> (l/) 

In section 2 we briefly describe the algorithm which 
we use. Section 3 describes the architecture of the GPU 
cards on which the program was tested and discusses the 
modifications we have to make in the usual CPU program 
for efficient running on the GPU. Finally in section 4 
we report on the performance of the program and our 
conclusions. 



II. THE HMC ALGORITHM 

Currently the most efficient algorithm for generating 
gauge configurations with dynamical quarks on the lat- 
tice is known as Hybrid Monte Carlo (HMC) [6J. In this 
algorithm the dynamics is generated by equations of mo- 
tion corresponding to the Hamiltonian 

W,P) = l*P 2 + s g (u ) + ^ ( Mi( v )M{u) ) 4>- (5) 
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Here p n .,^ £ su(N c ) is an auxiliary momentum variable 
with U and e lp being conjugate to each other, p is drawn 
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from a Gaussian distribution and cj) — M-^ {U)x where x 
is also drawn from a Gaussian distribution. Both p and 
X are periodically refreshed. 

The equations of motion (molecular dynamics equa- 
tions) define a trajectory in the phase space. Integrating 
these equations numerically is the most time consuming 
part in the algorithm accounting for 80 to 90% of the 
computation time depending on simulation parameters. 
This part of the algorithm involves repeated multiplica- 
tion of a matrix M. = (D + m) on the vector <f>. Paral- 
lclization of the matrix vector multiplication speeds up 
lattice gauge theory simulations to a great extent. 

The HMC algorithm thus refreshes the momenta peri- 
odically to ensure ergodicity, uses the Hamiltonian equa- 
tions of motion to move quickly through the phase space 
and perofrms a Monte Carlo accept/reject step to correct 
for the finite time step used in the integration. It is an 
exact algorithm. 



III. FERMI ARCHITECTURE AND 
ORGANIZATION OF THE GPU PROGRAM 

Our program was tested on two GPU cards the C2050 
and X2090 (different versions of the NVIDIA Fermi 
Card) and we briefly discuss their architecture below. 
Both cards have intrinsic double precision support and 
are CUD A compute version 2. 

The computation on the cards are carried out by multi- 
processors each of which have 32 cores and a maximum of 
64 KBytes of local storage (shared memory + LI cache). 
The C2050 has 14 such multiprocessors giving it a total 
of 448 cores while the X2090 has 16 such multiprocessors 
giving it a total of 512 cores. 

Nvidia GPU cards have several different categories of 
memory and their access times vary widely. The C2050 
has a global memory of 3 GB while the X2090 has a global 
memory of 6 GB. These are the largest memories but 
their access times are between 100 and 150 clock cycles. 
Every multiprocessor on the other hand has 32768 4-byte 
registers whose combined memory capacity is 128 Kbytes 
and access to the register variables takes only 1 clock 
cycle. Each thread can have a maximum of 63 registers. 
Fermi cards have the added feature that registers spill 
over to the shared memory which too can be accessed 
in 2 - 3 clock cycles. The combined storage capacity of 
registers and shared memory is about 192 Kbytes per 
multiprocessor or about 2.7 MB for the C2050 and 3 MB 
for the X2090. In our program we have attempted to use 
the registers as much as possible instead of the global 
memory. 

The CUDA compiler usually tries to place all scalar 
variables in the register [7] while arrays are typically 
placed in the global memory. In HMC, the molecu- 
lar dynamics part deals mainly with three large arrays: 
the force which is an array of dimension N c , N c ,4, N s i te 
and the momenta and the pseudo-fermion fields, both of 
which are arrays of dimension iV c , 4, -/V s it c - Our main chal- 



lenge therefore is to make sure that the elements of these 
arrays are placed in the register and not in the global 
memory. 

The structure of the program is roughly the same for 
the CPU and the GPU. There is an initialization step 
where the gauge field is initialized. The gauge field is 
then evolved through a certain number of trajectories. 
At the end of each trajectory measurements are carried 
out on the gauge configuration. 

The evolution of the gauge fields consists of the follow- 
ing steps for each trajectory. 

1. Setting up the pscudofcrmion fields and momenta. 

2. Integrating the molecular dynamics equations of 
motion by the leap-frog method. This sets up the 
proposed configuration. 

3. Accepting or Rejecting the proposed configuration 
depending on H ncw — Hold, where Hold is the Hamil- 
tonian at the beginning of the trajectory and % ucw 
is the Hamiltonian at the end of the trajectory. 

Communication between the CPU and the GPU is 
through a PCI bus. A major bottleneck if one frequently 
transfers data between them. To keep a check on this 
overhead, we tried to strike a balance between the time 
taken for transferring the data and the computational 
time on the GPU. Since all the input and output is car- 
ried out by the CPU we had to transfer the data back to 
the CPU at the end of each trajectory. 

For the computation intensive parts we wrote CUDA 
kernels (enumerated below) which carried out the follow- 
ing computations in parallel on the GPU. 

1. The result of operating the Dirac operator and its 
hermitian conjugate on the pseudofermion field — 
a matrix vector multiplication on a vector of size 
JV c x4x N sitc . 

2. The driving force for the molecular dynamics equa- 
tion due to the gauge fields. 

3. The driving force for the molecular dynamics equa- 
tion due to the fermion fields and evolving the mo- 
menta. 

4. Evolution of the gauge fields. 

We now briefly describe the essential modifications to 
the CPU subroutines for efficient running of the corre- 
sponding GPU kernels. 

We first try to optimize the two matrix- vector multipli- 
cation routines which compute the action of the Dirac op- 
erator and its hermitian conjugate on the pseudofermion 
field as a part of the conjugate gradient routine. So we 
first try to optimize this routine. The Dirac operator 
requires four constant 4x4 matrices (7^) for its defini- 
tion. Only 8 elements of these matrices are non-zero. We 
explicitly work out the result of the matrix-vector mul- 
tiplication for the 7 matrices retaining only the non-zero 
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elements. Thus the loop over the dirac indices is fully un- 
rolled. This however is not special to the CUDA kernels 
and is done also for the CPU subroutines. 

Generally we use the available CUDA BLAS functions 
as much as possible. However for the action of the Dirac 
operator on the pseudofermion field, it is not possible to 
use the BLAS matrix-vector multiplication routines as 
the matrix representing the Dirac operator is too large 
to be stored in either the CPU or the GPU memory. So 
we store the source and the resultant vectors and com- 
pute the necessary matrix elements on the fly. For this 
kernel, the main optimization task was to break down 
the array representing the pseudofermion field into scalar 
variables. Since the pseudofermion field typically consists 
of ~ 10 6 elements, it is impossible to do this by hand. 
We therefore wrote a code generator which runs over the 
necessary loops and yields the equations for the resultant 
vector in terms of independent scalar variables. A nam- 
ing convention we adopt is to name the scalar variables 
by concatenating the array indices with the vector name. 

The second computationally intensive task is comput- 
ing the driving force for the molecular dynamics equa- 
tion due to the fermion fields. On a single thread, they 
typically consume 15% of the trajectory time. After par- 
allelization of the matrix-vector multiplication routines 
on the GPU, this was consuming the largest amount of 
time. To optimize this routine, we followed the same 
strategy as in the matrix-vector multiplication case. Un- 
rolling the dirac loops in the force calculation is slightly 
more involved as they involve factors like ^j^fv 

However 

a bit of algebra again allows us to retain only the non- 
zero elements and with the help of a code generator we 
automatically wrote down the momenta evolution equa- 
tions. On the CPU this routine consumed significantly 
less time compared to the conjugate gradient routine. So 
this optimization was not carried out on the correspond- 
ing CPU subroutines. 

In addition to these optimizations, we also had a rou- 
tine to compute the neighbour indices on the fly to avoid 
storing the neighbour array on the GPU. 

The program is available from the authors on request. 



IV. PERFORMANCE AND DISCUSSIONS 

The speed-up of the GPU program was primarily 
judged by its performance on a C2050 Nvidia GPU card 
running at 1.15 GHz vis a vis its performance on an 
Opteron cluster with QDR Infiniband interconnect. Each 
node of the cluster consisted of two 6-core 2.2 GHz CPUs. 
We also have comparative results for one lattice size, viz. 
24 4 , on a CRAY XE6 and a X2090 Nvidia GPU card 
running at 1.3 GHz on a CRAY XK6 node. 

On the Opteron cluster the highest performance was 
obtained for one MPI process per node. We therefore 
fixed the number of MPI processes to 8 (number of avail- 
able nodes) and varied the number of OpenMP threads 
in each node. For the sake of uniformity, the same test 



Lattice 


GPU 


OMP 


SINGLE 


MPI+OMP 


8 4 


2691.084 


4264.125 


12551.820 


3725.766 


16 4 


37555.429 


84715.792 


409478.538 


74381.869 


24 4 


368173.325 


999445.502 


3717039.929 


376495.740 


28 4 


698531.978 


1567117.214 


7000000.000* 


1242310.046 



TABLE I: Timing (in seconds) of the program on a C2050 
GPU, single CPU, SMP (12 threads) and mixed MPI 
and OMP with 8 MPI processes each having 12 threads 
for different lattice volumes. These runs were performed 
on an Opteron cluster with QDR Infiniband interconnect, 
('estimated) 



Lattice 


GPU 


OMP 


SINGLE 


MPI+OMP 


8 4 


4.66 


2.94 


1 


3.370 


16 4 


10.90 


4.83 


1 


5.505 


24 4 


10.10 


3.72 


1 


9.873 


28 4 


10.02 


4.47 


1 


5.630 



TABLE II: Same as tabled but showing the relative speed-up. 



run pattern was followed even on the Cray. In addition 
to the total time taken, timings for each conjugate gra- 
dient call and fermionic force calculations (the two most 
time consuming subroutines) were written out. 

In table Q] we record the time taken for the various 
lattice volumes using different compute modes of the 
Opteron cluster and the C2050 GPU card mounted in 
the Opteron cluster. These volume scaling runs were per- 
formed at (3 = 5.5 and k — 0.48 with a time step of 0.2 
and 50 integrations per trajectory. The runs were for 200 
trajectories each. The timing for the single CPU run for 
the 28 4 lattice is an estimate based on the V 5 ^ 4 scaling of 
the HMC algorithm, since the actual run was taking far 
too long. In table |TT] we see the speed-up of the different 
computing modes over a single CPU thread. The CPU 
runs were performed using the Intel Fortran Compiler 
with the intel MKL while for the GPU runs we used the 
Nvidia CUDA compiler (nvee). For these runs we found 
that the performance of the C2050 card is roughly equiv- 
alent to the performance of 96 threads of the cluster (for 
lattice size 24 4 ). 

We now report the timing information we have on the 
Cray hardware. Each Cray XE6 node consisted of two 
16 core Opteron CPU at 2.1 GHz. On this machine the 
CPU code was compiled by the Cray fortran compiler 
and Cray libraries were used. The run parameters on 
the Cray were the same as the Opteron cluster. However 
instead of 200, only 2 trajectories were run starting from 
an ordered start. 

We first tested the scaling of this program with the 
Intel Fortan Compiler on the XE6 and found that the 
performance was roughly the same as the Opteron cluster 
upto 64 threads. Beyond 64 threads the performance gain 
continued on the XE6, but not on the Opteron. 

The GPU run on the Cray was done for a different 
CG convergence limit compared to the CPU runs (10 -5 
instead of 10 -8 ) and so we could not directly compare 
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FIG. 1: Timing for the conjugate gradient routine vs number 
of conjugate gradient iterations. Error bars represent the res- 
olution of the timing measurements (1 sec). The intercept of 
the graph gives an estimate of the time taken to launch the 
kernel. 
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FIG. 3: Timing comparison between the Cray XE6 running 
Cray fortran compiler, opteron cluster with Infiniband inter- 
connect running Intel Fortran compiler and the GPU cards 
C2050 and X2090 running nvcc and CUD A. 
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FIG. 2: Timing comparison between the CPU and the GPU 
inverter. Timings are for a CG inversion requiring 123 iter- 
ations. Two different CPU runs viz. Opteron cluster with 
Infiniband interconnect and Intel Fortran Compiler and Cray 
XE6 with Cray Fortran Compiler, are shown along with the 
timings for a C2050 and a X2090 GPU. 



the timing of the X2090 run with the other runs. Never- 
theless, since the CG inversion timing is linear with the 
number of CG iterations for both CPUs (see figure [TJ, 
it was possible to obtain a rough estimate for the timing 
for the X2090 for the same number of CG iterations as 
the other runs. An additional information that we get 
from figure Q] is that the overhead for the CG routine is 
about 0.5 sees for both the C2050 and X2090. 

In figure [2] we compare the inverter timings for the two 
GPUs as well as the Opteron cluster and the Cray XE6 
for a fixed number (123) of CG iterations. The C2050 
gives the performance of 8 Cray threads or 32 cluster 
threads while the X2090 gives a performance of about 20 
Cray threads or 64 cluster threads. 



Figure [3] shows the full timing of the runs for the clus- 
ter, the XE6, the C2050 and the X2090. From this plot 
we conclude that the performance of the C2050 is roughly 
equivalent to about 19 threads of the XE6, while that of 
the X2090 is that of about 26 threads. 

On the Cray XE6 nodes, the force calculation time was 
about 30% of the inversion time with 64 threads, with 
128 threads about 60% of the inversion time and the 256 
thread inverter took 60% less time than the force calcu- 
lation. Therefore, on hindsight, on a lattice of size 24 4 
it is worth unrolling the loops over the force calculations 
as with the inverter beyond 64 threads. On the Opteron 
cluster on the other hand the force calculation time was 
never more than 7% of the inversion time. Thus there 
not much is to be gained by unrolling the force loops. 

Our conclusion is that GPU programs can be speeded 
up to a great extent by using the register variables in- 
stead of the global memory. To effectively use the reg- 
ister variables we had to map array elements to scalar 
variables. This was carried out automatically by a code 
generator. Before doing this mapping the performance 
gain was a modest factor of two. Using the registers 
effectively pushed up the performance gain to a factor 
10 compared to a single thread performance on the lat- 
tice volumes we investigated. Such automated techniques 
might prove useful in the design of accelerator compilers. 

Here we did not use optimization techniques like mixed 
precision solvers or reconstructing the gauge link on the 
fly. We believe these optimizations can bring even more 
gains. 

At the moment we have tested the performance only on 
a single GPU. We hope to report our results on multiple 
GPUs shortly. 
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